Visualization of Large Complex Data

Steve Elston

09/11/2023

Review: Why is Perception Important?

Use Aesthetics to Improve Perception

Regression Lines

Regression lines draw viewers attention

Properties of Common Aesthetics

Property or Aesthetic Perception Data Types
Aspect ratio Good Numeric
Regression lines Good Numeric plus categorical
Marker position Good Numeric
Bar length Good Counts, numeric
Sequential color palette Moderate Numeric, ordered categorical
Marker size Moderate Numeric, ordered categorical
Line types Limited Categorical
Qualitative color palette Limited Categorical
Marker shape Limited Categorical
Area Limited Numeric or categorical
Angle Limited Numeric

Visualizing Large Complex Data is Difficult

Problem: Modern data sets are growing in size and complexity

Note: we will address use of dimensionality reduction techniques in another lesson, time permitting

Limitation of Scientific Graphics

All scientific graphics are limited to a 2-dimensional projection

Approaches to display of complex data relationships

Generally combine multiple methods to effectively display complex data

Scalable Chart Types

Some chart types are inherently scalable.

Over-plotting

Over-plotting occurs in plots when the markers lie one on another.

Dealing with Over-plotting

What can we do about over-plotting?

Example of Overplotting

Use Transparency, Marker Size, Downsampling

Down sample to 20%, alpha = 0.1, size = 2

Other Methods to Display Large Data Sets

Alternatives to avoid over-plotting for truly large data sets

Hexbin Plot

Example: Density of sale price by time

Countour Plot

Example: Contour plot of 2-D KDE of sale price vs. time

Heat Map

Example: Airline passenger counts by month and year displayed by squential heat map

Heat map of

Heat map of

Mosaic Plots

How can we display multidimensional count (categorical) data at scale?

Mosaic Plots

Example: How do counts change with working vs. non-working day, weather and year?

fig, ax = plt.subplots(figsize=(25, 20))
plt.rcParams.update({'font.size': 18})
_=mosaicplot.mosaic(bike_share_df.loc[:,categorical_cols], 
                    index=list(categorical_cols), 
                    title = 'Counts of weather conditions',
                    ax=ax)

plt.show()

Mosaic Plots

Facet plot of wind by month

Facet plot of wind by month

Other Methods to Display Large Data Sets

Sometimes a creative alternative is best

Time Series of Box Plots

Example: Time ordered box plots of quarterly sales price

Displays for Complex Data

How can we understand the relationships in complex high-dimensional data with many variables?

Displays for Complex Data

How can we understand the relationships in complex high-dimensional data with many variables?

Arrays of Plots

Display multiple plot views in an array or grid

Scatter Plot Matrix

Scatter plot matrix used to investigate relationships between a number of variables

Scatter Plot Matrix

Scatter plot matrix create two dimensional array of plots of variable pairs
- Upper triangular plots: Scatter plots with regression lines
- Lower triangular plots: Hexbin plots of density
- Diagonal plots: Histograms of variables

g = sns.PairGrid(diabetes.drop('Sex', axis=1), hue='sex_categorical', palette="Set2", height=1.5)
_=g.map_upper(sns.regplot, order=1, truncate=True, scatter_kws={'s':0.5})
_=g.map_lower(plt.hexbin, alpha=0.5, cmap='Blues', gridsize=15, linewi dths=0)
_=g.map_diag(plt.hist, histtype="step",linewidth=1)
plt.show()

Scatter Plot Matrix

Scatterplot matrix with different plot types

Scatterplot matrix with different plot types

Facet Plots

Facet plots revolutionized statistical graphics starting about 30 years ago

Facet Plots

Like many good ideas facet plotting was invented several times

Facet Plot with wind speed by Month

Facet plot projected on a grid

g = sns.FacetGrid(bike_share_df, col='month', col_order=calendar.month_abbr[1:] ,col_wrap = 4, height=5)
g = g.map(plt.hist, "windspeed", bins=20, color="b", alpha=0.5)

Facet Plot with wind speed by Month

Facet plot of wind by month

Facet plot of wind by month

Facet Plot of Hourly Counts by Weather and Season

Example: Plot count of riders by hour, conditioned on weather and season

g = sns.FacetGrid(bike_share_df, col="Weather", col_order=weather.values(), row="Season", height=2, aspect=2)
g = g.map(sns.scatterplot, "hr", "cnt", s=3, alpha=0.2)
for ax in g.axes.flat:
    ax.set_title(ax.get_title(), fontsize=10)

Facet Plot of Hourly Counts by Weather and Season

Example: Plot count of riders by hour, conditioned on weather and season

Facet plot of count by season and weather

Facet plot of count by season and weather

Facet Plot of Hourly Counts by Weather and Season

Example: Plot count of riders by hour, conditioned on weather and season

g = sns.FacetGrid(bike_share_df, col="Weather", col_order=weather.values(), row="Season", height=2, aspect=2)
g = g.map(sns.boxplot, "hr", 'cnt', color='lightgray', order=None)
for ax in g.axes.flat:
    ax.set_title(ax.get_title(), fontsize=10)
for i in range(4):
    g.axes[i,0].set_ylabel('Count') 

Facet Plot of Hourly Counts by Weather and Season

Example: Box plot counts of riders by hour, conditioned on weather and season

Facet plot of count by season and weather

Facet plot of count by season and weather

Congnostics

How can we visualize very high dimensional data?

Cognistic: States With Fastest Rate of Housing Price Increase

## Add an intercept column to the data frame
housing.loc[:,'intercept'] = [1.0] * housing.shape[0]
## Demean the decimal time column
mean_time = housing.loc[:,'time_decimal'].mean()
housing.loc[:,'time_demean'] = housing.loc[:,'time_decimal'].subtract(mean_time)

## Find the slope coefficients for each state
def prepare_temp(df, group_value, group_variable = 'state'):
    temp = df.loc[df.loc[:,group_variable]==group_value,:].copy()
    mean_price = np.mean(temp.loc[:,'log_medSoldPriceSqft'])
    temp.loc[:,'log_medSoldPriceSqft'] = np.subtract(temp.loc[:,'log_medSoldPriceSqft'], mean_price)
    std_price = np.std(temp.loc[:,'log_medSoldPriceSqft'])
    temp.loc[:,'log_medSoldPriceSqft'] = np.divide(temp.loc[:,'log_medSoldPriceSqft'], std_price)
    return(temp, mean_price, std_price)
 

def compute_slopes(df, column, group_variable='state'):
    slopes = []
    entities = []
    intercepts = []
    for e in df.loc[:,column].unique():
        temp, mean_price, std_price = prepare_temp(df, e, group_variable=column)
        temp_OLS = sm.OLS(temp.loc[:,'log_medSoldPriceSqft'],temp.loc[:,['intercept','time_demean']]).fit()
        slopes.append(temp_OLS.params.time_demean)
        intercepts.append(temp_OLS.params.intercept)
        entities.append(e)
 
    slopes_df = pd.DataFrame({'slopes':slopes, 'intercept_coef':intercepts, 'entity_name':entities})    
    slopes_df.sort_values(by='slopes', ascending=False, inplace=True)
    slopes_df.reset_index(inplace=True, drop=True) 
    return slopes_df

#compute_slopes(housing, 'state')
state_slopes = compute_slopes(housing, 'state')

## PLot states with the fastest growing pricing
def find_changes(df, slopes, start, end, col='state'):
    increase  = slopes.loc[start:end,'entity_name']
    increase_df = df.loc[df.loc[:,col].isin(increase),:]
    increase_df = increase_df.merge(slopes, how='left', right_on='entity_name', left_on=col)
    return(increase_df, increase)
big_increase_states, increase_states = find_changes(housing, state_slopes, 0, 7)    

## Display scatterplot vs time
def plot_price_by_entity(df, order, entity='state', xlims=[2007.5, 2016.5]):
    g = sns.FacetGrid(df, col=entity, col_wrap=4, height=2, col_order=order)
    g = g.map(sns.regplot, 'time_decimal', 'log_medSoldPriceSqft', 
              line_kws={'color':'red'}, scatter_kws={'alpha': 0.1, 's':0.5})
    g.set(xlim=(xlims[0],xlims[1]))
    plt.show()

_=plot_price_by_entity(big_increase_states, increase_states)

Cognistic: States With Fastest Rate of Housing Price Increase

States with greatest increase in housing price

States with greatest increase in housing price

Summary

We have explored these key points

Summary

Generally combine multiple methods to effectively display complex data